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Most existing implementations of multiple precision aritltmetic demand 
that the user sets the precision a priori. Some libraries are said adaptable in 
the sense that they dynamically change the precision of each intermediate 
operation individually to deliver the target accuracy according to the ac- 
^— V , tual inputs. We present in this text a new adaptable numeric core inspired 

1^^. ■ both from floating point expansions and from on-line arithmetic. 

("^ ' The numeric core is cut down to four tools. The tool that contains arith- 

metic operations is proved to be correct. The proofs have been formally 
checked by the Coq assistant. Developing the proofs, we have formally 
^ I proved many results published in the literature and we have extended a 

^ . few of them. This work may let users (i) develop application specific adapt- 

able libraries based on the toolset and / or (ii) write new formal proofs 
based on the set of validated facts. 

X' 

H 

P. \ Introduction 

We will first present examples of numerical computations coming from four 
fields of application. With each example we introduce new properties, new 
goals and some needed references. As the subject will become clear, we will 
present the contributions of this text and its summary. 

Motivation 

As an undergraduate student, we have all learnt to compute a quantity with 
a few significant digits first as 5 miles is about 8 kilometers. If the question 
is "Are 4.995 miles more than 8 kilometers?", we are close to a threshold. To 
answer the question, we would compute more (less significant) digits to find 
that 4.995 miles are a little over 38 meters longer than 8 kilometers. We might 
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first do a multiplication of 5 by 1.6 and then the exact multiplication of 4.995 by 
1.609344. However, we could save that 5 times 1.6 is exactly 8 and afterward 
just multiply 5 by 9 to obtain 45 meters with a small additional value that is 
less than 5 meters. That answers the question since 0.005 miles is less than 10 
meters. 

Numerical analysts have long designed algorithms that react to the accu- 
racy aimed by the user. Countless specific algorithms reduce the error on the 
final result by iterative refinement in solving algebraic equations, linear sys- 
tems, ODEs, PDEs and so on [pQ]. Such methods are so robust that the final 
solution can be very accurate even if the first attempt before any refinement is 
not close. 

All the above-mentioned solutions assume that the intermediate arithmetic 
operations are performed up to a fixed level of precision. This level, usually the 
one given by machine double precision arithmetic, is used for all the operations 
and is sufficient to store the result to the accuracy aimed by the user Recent 
work presented in [ pO| , pi] , ^ proposes new enhanced routines if two levels of 
precision are available on the machine. The second level can be obtained by 
a careful use of the tools presented in this article [^ . All these algorithms use 
the machine arithmetic as an error prone tool that must be circumvented by 
mathematical analysis. 

Adaptability versus multiple precision 

A numerical algorithm is only slightly modified to accommodate adaptable 
arithmetic or multiple precision arithmetic. With multiple precision arithmetic, 
each operator is fired only once and the result is computed with the precision 
fixed at compile time. Static error analysis is used to guarantee that the final 
result is known to the accuracy specified by the user Being static, the error 
analysis is often too pessimistic. As it is readily available, multiple precision 
arithmetic is often chosen by users. 

When a code has been modified to become adaptable, every single operator 
is started with a very limited precision. The runtime system freezes the opera- 
tor but it is able to resume its work w^hen necessary. The program is not ended 
unless the final result is known to the accuracy specified by the user. The eval- 
uation continues until this is the case, with the system increasing automatically 
the working precision of each operator independently. 

Coherency is key to the field of computational geometry where one sin- 
gle tiny error may change a convex hull to a non-convex, possibly non-planar, 
graph. Numerical routines are used to answer such predicates as "Is a point 
in a circle defined by 3 other points?". A Boolean answer cannot be approxi- 
mated, it is either right or wrong, and one single wrong answer might lead a 
program to an abstract state that cannot occur in Euclidian geometry. Recent 
works have proposed clever implementations of multiple precision arithmetic 
[E3, B3p to answer correctly such questions. They reinforce the existing set of 
general purpose multiple precision libraries yiQ , Bsl 0, pi E4I . Some marginal 
work does implement iterative refinements of the geometric predicates ]M or 
change the algorithm to work correctly with limited predicates [pi . 

Adaptability is used for our last example. It is the central scheme of Ziv's 
paper [ p7| ] w^here the author implements correctly rounded elementary func- 
tions (circular and exponential). Correct rounding for the floating point num- 



bers is mathematically defined by two international standards (ANSI-IEEE- 
ISO 754 and 854 [M, pR E6l ^]). It is a monotonous projection from the real 
set to the set of the floating point numbers. If all the numbers of a given inter- 
val round to the same floating point value, you can safely round this interval 
to the common result. Otherwise, you cannot round the interval. 

An approximated algorithm such as the ones used for the evaluation of the 
elementary functions PSQ does not produce directly the rounded result but it 
computes an approximated result and an error bound. This pair defines an 
interval that contains the exact result. 

• If the interval is narrow, it rounds to one single floating point value that 
is the correctly rounded value of the exact elementary function. 

• If we discover from the current approximation that the interval contains 
a discriminating point, the approximation must be refined to reduce its 
width. This process continues until a satisfying approximation is ob- 
tained to round the interval to one single value. The set of the discrim- 
inating point is the set of the representable numbers for the directed 
roundings and the set of the midpoints when rounding to the nearest. 

With the first iteration of Ziv's scheme, machine precision is used. For the 
result interval to get sharper, the approximation scheme is only slightly modi- 
fied but the target precision is increased and the arithmetic operations become 
increasingly slow. Cases where a high precision result is needed are very infre- 
quent and the running time seen by a user will most probably be the time for 
the first or the second iteration. 

Adaptable behaviour is sometime recreated with Mathematica or Maple. 
These two softwares let the user change the working precision and they pro- 
vide some criteria to estimate the final accuracy of many numerical routines 
yi4|. Users set the number of digits of the intermediate precision after a few 
educated random attempts. 

Prior art 

Adaptability Some few authors already presented a general purpose adapt- 
able arithmetic library [^ |8[ ^ |[ ^ ^^, but some others stopped with one 



or two adaptable attempts followed by an exact evaluation [|24[ |50|] . These last 
solutions are appropriate in many cases because the first few attempts are fast 
and they succeed by far in the most frequent case for many applications [ pi| ] . 
These authors where forced to stop because they used a data structure that is 
not appropriate for adaptability. They forced one or two rounds of adaptability 
with great efforts. 



Floating point expansion This data structure was introduced by Priest 
based on some earlier operations [B^ B3] . Actually, Moller and Dekker were the 
first in the past to propose such techniques to extend precision on the floating 



point unit [42, il|,|2^]. Kahan and Pichat applied similar techniques for other 
purposes [}32|, 44|]. 

Fast algorithms, tightly connected to the machines, are possible with lEEE- 
754 compatible commercial floating point units. Shewchuk was the first one 
to present a working library available on the net with an actual application 



to computational geometry [toL pOl]. Assuming that the floating point unit is 
IEEE compatible gives us a powerful set of axiomatic properties on floating 
point operations. We have proposed in the past some algorithms and their 
implementation for multiplication and division on expansions [ ]l5| , la O] . De- 
velopments are still undergone as new research teams get interested by this 
approach [}47|]. Our contribution forces us to define a new kind of expansions. 
We call them pseudo-expansions. 

On-line arithmetic We will present in this text that we can reduce the growth 
of the running time by setting up an appropriate data structure that will not 
restart an adaptable evaluation from scratch as the intermediate precision is 
modified. 

This data structure is inspired by on-line arithmetic w^here adaptability is 
natural. On line arithmetic was targeted in the past to hardware design with 
small radices (typically 2 or 4) [p6[ |lj, |6 , 19 , 1^ but it has proved to be suitable 



for software applications 

Contributions 

The first contribution of this work is a set of primitives available soon on the In- 
ternet through our home web siteR and through the NETLIB repositoryn A toy 
example shows that high level programming with object interface is too slow 
for the kernel of numerical softwares [2q]. We ran the same FLOPS routines 
three times [E2l ESp . The first run was performed with object automatic alloca- 
tion and destruction. The second run used in-line function calls and no object 
allocation. The final code used explicit code in-lining. Removing object pro- 
gramming increased the performance of 133% and using code in-lining yielded 
another 56%. Our contributed building blocks are pieces of code assembled by 
the designer according to its needs. They present a straight interface to get a 
good trade off between simplicity and speed. 

The second contribution is a set of validated proofs. To mechanise our 



proofs, we have been using the Coq proof assistant [ 31 1 . Systems like Coq allow 
the user to define new objects and to derive consequences of these definitions 
formally. The language of Coq is based on a higher-order logic. With such an 
expressive logic, it is possible to state properties in their most general form. 
For example, universal quantification has been used to state properties that are 
true for an arbitrary format or an arbitrary rounding mode. Proofs are built 
interactively using high-level tactics. At the end of each proof, Coq records 
a proof object that contains all the details of the derivation and ensures that 
the theorem is valid. Theorem provers have already been successfully used to 
mechanically check the correctness of floating-point algorithms [ ^ p9[ |. 

Our formal development is freely available on the Internet[|and it will soon 
be available as a Coq contribution^ A clickage map of the hierarchy makes it 
possible to browse into the different components. At the moment, it is 22000 
line long and it contains 106 Definitions and 545 Theorems. No proof is pre- 
sented in detail in this manuscript, but we outline the main results used for 
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the proofs. Since the proofs are mechanically validated, the reader is sure that 
the annoying special cases have been checked and that all the conditions of the 
properties used in the proof have been met. For each mentioned theorem we 
give its name as it appears in the formal development and the file where it is 
proved. 

We first present definitions and validated properties for the floating point 
numbers, the expansions and the pseudo-expansions. Section 2 presents the 
addition, multiplication and division toolset. This presentation ends with con- 
cluding remarks in the last Section. 

1 Definitions and validated properties 

1.1 Floating point numbers 

An IEEE double precision floating point number is built from 3 binary fields: 
the sign (1 bit), the fraction (52 bits) and the biased exponent (11 bits). Its nor- 
mal interpretation a; as a rational number is given below. 

mantissa = 1 .fraction 

exponent = biased exponent — bias 

X = (-l)sign X mantissa x 2*="?°"^"' 

We will use in this text the more general notation x = n x fi"^ io define a 
floating point number of radix /3 with two integers, the significand n and the 
amplitude e. This relation gives a signification to any pair [n, e). We will use 
the name floating point number or float to represent such a pair. Two floats 
(?i, e) and (n', e') are equivalent if they have the same value as real numbers, 
we write [n, e) = (n', e'). We extend this notation to the represented real value 
and we may write n x P^ = (n, e). We will also use the equivalence for ex- 
pression whenever there is no ambiguity. No order is defined on the floats and 
inequalities are given implicitly on their interpretation. 

Given a pair (rimax, fimin) £ N^, we say that a float (n, e) is bounded if and 
only if it satisfies 

\n\ < TT-max and - eniin < e. (1) 

We do not use the overflow bound of e^ax- The IEEE standard defines 
an overflow when the rounded value of the result exceeds the largest repre- 
sentable number It implies that the rounding is defined on a data type that 
does not have an upper boiind for the exponent. All the results of this text are 
true provided neither overflow nor exact infinity (division by 0) occurred in 
the computation. If this is not the case, infinity and not a number quantities will 
proliferate in the results. 

The bound for an IEEE standard inspired representation with p digits of 
mantissa and r digits of exponent is given Table ||. We will use the common 
fraction notation for the examples although the integer notation is used for the 
proofs. We use only radix independent generic IEEE standard inspired bounds 
in the rest of this text. A bounded float is normal if \n\ x (3 > rimax- It is 
de-normal if |n| x /3 < rimax- A de-normal float such that e = — e,nin is called 
subnormal. We define the ulp value as one unit in the last place of 1 that equals 

/3/(nmax + 1). 



Table 1: IEEE standard inspired bounds 
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Theorem 1 (FnormalizeCanonic and FcanonicUnique in Form) Any bounded 
float has one unique equivalent float {n, e) that is normal or subnormal. This float is 
the direct transcription of the number in machine with 



n 
e 



mantissa x (3^ ^ 
{biased exponent — biais) 



p+l. 



It is later referred in this text as the canonical representation. 

The IEEE standard describes four rounding modes but the rounding to 
the nearest floating point number is the rounding mode used by default in 
most computers. A rounding is generally defined as a monotonous projection 
onto the set of the canonical floats. Considering all the floats rather than just 
the canonical ones, we would rather define it as a projection onto subsets of 
bounded floats. 

For any real number x, we define its bounded floor \x\ as the class of the 
maximum bounded floats smaller than x. It means that a bounded float (n, e) 
is in \x\ if and only if, for all the bounded floats (ri', e') 



{n\e') <x 



(n',e')<(n,e) 



We define identically its bounded ceil \x~\ as the class of the minimum bounded 
floats larger than x. We define the truncated subset F{x) as follows. 



F{x) 




<e} 



x > 
X < 
x = 



We also define the class of the bounded floats nearest to x as □ (x) with (n, e) 
is in n(x) if and only if, for all the bounded floats (n', e') 

|(n, e) — x| < |(n', e') — x\ 

When n(x) contains two distinct equivalence classes of bounded floats, we 
define o(x) as the class of □(x) of the elements where the canonical represen- 
tation has an even significand. In other cases, o(x) = n(x). The following 
theorem states that this definition is appropriate. 



Theorem 2 (*RoundedModeP in Fround and Closest) The relations rounding 
down [-J , up [•] , to zero F{-) and to the nearest (even) o(-) define each a total monotonous 
projection from the reals onto the hounded floats compatible with the interpretation of 
the floats. 

We now give a few results on the rounding to the nearest. The theorem 
could be defined on the canonical representation but it is better to prove it for 
any bounded float. 

Theorem 3 (ClosestErrorBound in ClosestProp) Let x be any real number and 
(n, e) any float ofn(x), 

Ix-ufS^l </372. 

That relation is independent of the rounding being even or not. Even rounding provides 
that if both in, e) e o(.t) and \x — {n, e)| = P'^/2 then n is even. 

The following theorem gives a relation for every representation of the rounded 
value and the error provided that the error is non zero and it does have a 
bounded representation. The universal quantifier is the key to prove the theo- 
rems and |0[ 

Theorem 4 (RoundedModeErrorExpStrict in FRoundProp) Given an arbitrary 
rounding mode, let xbea real number and {n, e) a bounded float rounded from x, such 
that X is not represented by (n, e) (ie. x ^ nfi^). For all bounded float (n', e') that is 
a representation ofx — nfi'^, 

e < e. 

The bound is tight as in' , e') = (1, e — 1) is an acceptable round-off error. 

The result of any implemented operation, namely the addition, the multi- 
plication, the division and the square root extraction, is the rounded result of 
the exact mathematical operation. For example, if a © 6 is the machine addi- 
tion then a ®h G o[a + h). The machine operation could have been defined 
to return the canonical float in o(a + fe). We preferred to define an acceptable 
implementation as a function that return any bounded float in o(a + 6). 

When a theorem is independent of the tie breaking rule implemented, we 
use the boxed symbols E3, H and K for the addition, the subtraction and the 
multiplication in the hypothesis. Figure |l| presents the symbol of the four stan- 
dard floating point operators used in this work. All the figures use even round- 
ing as no other implementation exists. 

1.2 Exact operations 

We used for our proofs of the exact operation a result first published in [pSp. 
It was later presented in [|^ for the radix 2 notation. This fact is true for any 
bounded representation with any radix. 

Theorem 5 (Sterbenz in Fprop) Given two bounded floats x and y such that | < 
X < 2y, the rational x — y can be represented by a bounded float. 

We immediately prove the next result from its negation after setting (3 to be 
2. We use it later in our proofs. 



Addition b Multiplication b 

a u} o(a + 6) a HaK o(a x b) 

Subtraction b Division b 

a -^ 



:(a - b) 



^{a ^ b) 



Figure 1: Standard floating point operations rounded to the nearest (even) 
value. 
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Figure 2: Exact sum 



Theorem 6 (plusClosestLowerBound in Closest2Plus) Given two bounded floats 

radix 2, x and y, such that x + y ^ x®y,we know that \x®y\ > max(|a;|, |y|)/2. 

Meller first, then Knuth proposed the common exact sum of Figure ||-(a) 
involving 4 floating point additions and 2 floating point subtractions on an 
IEEE compliant computer The result is a pair of floats (a', b') such that a' = 
a®b and a' + b' = a + b. 

We have proved this algorithm correct (Theorem Knuth in TwoSum) us- 
ing the proof given by Priest in [E5|. Actually we will never use directly the 
algorithm but only the property that the sum and the rounding error can be 
represented. This is proved picking bounded representations that have special 
properties: 

Theorem 7 (plusExactExp in ClosestPlus) Given two bounded floats a = (ria, Ca) 
andb^ {rib, Cb), for alia' G aSb, zve can define {n,e) = a' andb' = {n',e'), bounded 
floats, such that 



b' 



, + b and inin(ea, Cb) = e' < e < niax(ea, et,) + 1. 



This theorem uses the Theorem S and Theorems errorBoundedPlus, plus- 
ExpBound not presented here to define an existing representation of a' and b'. 
The theorem is used to connect both results since it is true for any represen- 
tation of a' and b'. The special case where fe' = is handled independently. 



The last Theorem prompts the fact that the addition is accurate up to one 
unit in the last place. Such result is needed to show that the numerical tools of 
the next Section actually do some work. 

Theorem 8 (plusErrorBoundUlp in ClosestPlus) Given two bounded floats a and 
b,for all a' & aSb such that a' ^ 0, we know that 

\a + b-a'\< \a'\— < max(|a|, |6|) iilp . 

The last inequality is true only working radix 2. It is proved in theorem plusError- 
Boundl in file ClosestlPlus. 

It has also been proved by Moller and presented in Knuth's second edition 
that one can get a correct pair (a', 5') by an early exit of the exact sum provided 
\b\ < \a\. The conditional sum of Figure ||-(b) returns an exact pair {a' , b") = 
(a', b') with only 2 floating point additions and one floating point subtraction. 

Knuth suggested that the early exit is still valid if a and b are both canonical 
and share the same exponent. We prove here a result rephrased from yiq ] that 
a and 6 just have to be bounded and the exponent be in order. 

Theorem 9 (ExtDekker in EFast2Sum) On a radix 2 IEEE inspired floating point 
unit, given two hounded floats a = {na,ea) and b = (?i;,,eh), the conditional sum 
presented Figure^-(b) returns an exact pair {a (B b, a + b — a (B b) provided 

This theorem is proved only for radix two notations as it is not necessarily 
correct for other radices. An example where the early exit is not correct radix 
10 is given for rtmax = 99 with both inputs being 9.9. The theorem is correct for 
radix 2 and 3 Pq]. One can check one last example radix 4: rimax = 3 and both 
inputs are 3. 

An exact multiplication is also available. It computes a pair (a', b') such that 
a' = a®b and a' + b' ^ a x b with 7 floating point multiplications, 5 floating 
point additions and 5 floating point subtractions. These operators are surveyed 
in [p5|]. Here are some properties obtained from the desired specification of a' 
and b', not from the algorithm. The condition on the exponent is not necessary 
but it is sufficient to discard cases of underflow. 

Theorem 10 (multExactExpCan in ClosestMult) Given two hounded floats a = 
(ria, Ba) and b = {nb, cb), with Ca + Cb > — emin + p,for all a' rounded to a nearest 
float ofaxb, we can define {n, e) = a' canonic and b' = (n', e') bounded float, such 
that 

a' + b' = a X b and e' = e — p. 

It was proved that comparable error quantities, that always fit in a common 
floating point number, may be defined for the division and square root [R]. 

1.3 Floating point expansions 

A floating point expansion is defined as a finite sequence x = {xq ,xi,---,Xn-i) 
of floating point numbers. The value represented is the exact, not rounded. 



sum of its components ^ Xi . The length of the expansion is the number of its 
components. 

Any component of an expansion may be equal to zero, but the subsequence 
of the non-zero components Xi must be non overlapping and ordered by mag- 
nitude. The non overlapping condition means that two components cannot 
have significant bits with the same weight that is specified for each i > there 
exists a bounded float (n, e) such that 

Xt = {n,e) and |xi+i| < /?". 



We have already presented two examples in [hsp showing that any float 
{n, e) where e > — e,„in can be expressed as an expansion. 

As noted in the introduction, previous works by Priest, Shewchuck and 
ourselves have produced arithmetic operators on expansions and useful prim- 
itives for computational geometry. In this process, we recognized that the op- 
erators for the addition and the multiplication computing the least significant 
digits first do not produce length optimal results but tend to break the expan- 
sion in a large number of components with small significands. A compression 
routine is used to group together the components and reduce the length of the 
result expansion. 

It is possible to compute the arithmetic operations most significant digits 
first and compute directly components with a large significand. The common, 
both restoring and non restoring, division operators compute the components 
of the quotient like this. As a drawback, some of the components may overlap 
slightly. In our former work [|7|, the sequence of the quotient digits is cleaned 
by a compression routine that produces an acceptable expansion. 

We define pseudo-expansions as slightly overlapping expansions. The con- 
dition on the subsequence of non zero components as that 

\xi+i\ < e\xi\. 

We do not give the value of e < i now as it will be fixed in the next section 
considering the numerical algorithms. We deduce that 



Conversely, given a sequence w^here 
for all i, we deduce that 

2 Toolset for the addition, the multiplication and 
the division 

We combine in this Section tools to create the arithmetic operations: MQ, PP, 
PQ and S3. The tools have been implemented in C. Among them, the numeri- 
cal behavior of S3 alone is difficult to analyze as the three other tools only sort 
and produce components as they are needed. 
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Figure 3: Representation of the addition. 




Figure 4: Definition of the S3 operator. 



2.1 Addition 



Figure || represents the algorithm implemented to compute the sum of pseudo- 
expansions A and B. We use two primitive operators MQ (merge queue) and 
S3 (insert and sum of Figure ^. Arrows represent streams of floating point 
number flowing most significant digits first. The components of the result are 
in the stream named C in Figure |[ The output is only a pseudo-expansion 
since the a' are sorted by decreasing order of magnitude but they may partially 
overlap each other. 

MQ is an operator fired only if it has a value on both inputs or if the flow 
of one input is finished and it has a value on the other input. The largest input, 
in magnitude, is then transmitted to the output. Cases of tie are not critical. 
The MQ operator merges two flows sorted by magnitude and produces a sin- 
gle flow once again sorted by magnitude. The following theorem states that 
this flow of values can be represented by a flow of floats to be fed into the S3 
operator. 

Theorem 11 (IsRleExpRevIsExp in FexpAdd) Given a list of floats {xi)i^L sorted 
by magnitude, we can define the bounded floats {ni,ei) equivalent to the XiS such that 
the list {ei)i^L is sorted. 

We deduce the correct behavior of the operation by induction from the fol- 
lowing Theorem. The property on the input floating point number c is inher- 
ited from the fact that the two input streams A and B are pseudo-expansions 
merged by the MQ operator by order of magnitude. 
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Theorem 12 (boundSSum and expSSum in ThreeSum2) Let a = {ua, ea), b = 
{nb,eb) and c = (jicBc) be the bounded inputs of the E3 operator (see Figure ^. 
Provided that Ca > eb> Cc the E3 operator returns three numbers a', b', c' represented 
by (n^, e^), (n'j,, ej,) fl«d (n^, e^) smc/z t/zflf e^ > ej, > e'^ = Cc andfinaly either c' = 
or 

\b' + c'\ <3ulp|a'|, 

If a = or |5| + Tiniax/?"^" < '^max/3'^°/ ^'^ f^z^ exflct additions can use the early exit. 

If c' equals to zero, a' and 5' are not relevant enough. They remain in the 
operator as a and 6 for the next iteration. On the opposite, if c' is not equal to 
zero, a' is relevant. It is one component of the result, h' and c' are kept in the 
operator for the next iteration. 

Proof sketch: We first define u, v, a', h' and c' as they are computed in Fig- 
ure H We are interested in the case where c' ^ 0. It implies that w ^ Q. We 
bound 

\M < ^\a'\ 

\v\ < ulpmax(|u|, \a\) 

Since w 7^ 0, we know that 

^ max(|M|,|a|) 
I I - 2 

The first bound follows since 

\b' + c'\ = \v + w\< \v\ + \w\. 
We continue for the second bound 

3ulp|a'|>|5' + c'|>|6'|-|c'|> (l - ^) l^'l- 
We conclude with the representation of Theorem 



2 ' ' - 2 - ulp ' 



D 



The last theorem does the induction over time. It is written from both state- 
ment and proof of the theorem in Coq. 

Theorem 13 (FexpAdd in FexpAdd) Given a list (ui, eiji^L ofjj^L bounded floats 
so that {ei)i^L is sorted the addition operator (see Figure^ produces a list {ci)i^L of 
at most (#L + 1) floats with 

6#L + 6 
Cj+i| < e|ci| with e < 



X - 1 - 6#L 
under the condition that e < 1. 



12 



Pij = ai® bj 




key: \pij\ulp/2 



Figure 5: Representation of the multiplication. 

Proof sketch: We have to show that the input conditions are met at each iter- 
ation using pre-conditions and post-conditions. In the same time, we establish 
that 

l5]c,|<3(l + 2#L)ulp|c,|, 

and 

|c,+i|<6(l + 2#L)ulp|c,|. 

Being more careful, we proved the tighter bound in Coq. 

a 



2.2 Multiplication 

The multiplication of the expansions A (size ua) by the expansions B (size ns) 
generates the ua x ub partial products and computes their sum. The problem 
is to sort the partial products a^ x bj knowing that the lists (a;) and (bj) are 
sorted by magnitude. It is related to X + Y sorting p8[ p3] . Figure S represents 
the algorithm implemented. We use new and extended primitives: PP, a partial 
product generator, PQ and a priority queue extending the MQ. 

Working with pseudo-expansions, we have to make sure that the list of the 
partial products is sorted by magnitude. As bj arrives at position j, the PP tool 
initializes the ij index to and computes a^ (E) bj. Its magnitude is inserted at 
the end of PQ that is updated in [log jj operations. When the partial product 
a.i bj is output by PQ, the tool increments ij . The generation is frozen until 
both components a^ and bj^i are known. Then the cell containing ja; (g) bj \ in 
PQ is updated in [log jj operation and the tools can generate the next partial 
product. 

Numbers flow by pair since a; x bj produces two floating point numbers. 
The upper part pij = a^ bj is used immediately. The lower part a, x bj - 
Pij is put into a waiting queue. The two streams are merged together by the 
MQ operators. It is fired as long as it has an input on both entries. It does 
not compare the values but the keys that are associated with the floats. The 
operator does not produce a list of floats sorted by magnitude but directly a 
list sorted by amplitude as we use the second part of theorem ^ 
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do-D 

Q 




c > 4:\a + b\ulp 



key: \pij\ulp/2 



S. 



b' 



-E3 



y 

a'(Bb' 

do 



Figure 6: Representation of the non-restoring division. 



As Ui X bj — pij cannot be selected before pij, its insertion into the waiting 
queue is not in the critical path. Instruction reordering will allow these oper- 
ation to be hidden in dead cycles of the S3 operator. The waiting list cannot 
hold more than 2jib elements since the bound on e is a few ulps. 

2.3 Division 

Figure ^ represents the division of R by D, pseudo-expansions. We want to 
compute Q such that Q = R/D, ie. R — QD = 0. Each component g^ of Q is 
deduced from the previous ones by computing 



R, = R-Y,Qj xD, 



before evaluation qi « Ri/D. We will see later that do is handled separately so 
our scheme computes Q x (do — D). 

The goal is to ensure that the values c used as inputs of the E3 operator 
are sorted by amplitude. The PP-PQ-MQ chain is close to the one used for the 
multiplication in the Figure |[ The generation is frozen only when dj+i is not 
known. 

The test on qi is replaced by a new condition: the operation stops before it 
may insert a value c such that c < 4|a + &| ulp. A new approximate quotient 
digit qi is then guessed from a fair most significant component d^ of the divisor 
D and a fair most significant component a' © h' . 

The following theorem states that one division step is very robust. For a 
reasonable value of e, it guarantees that the remainder decreases almost lin- 
early. 

Theorem 14 (DivConv in FexpDiv) Let W and D he two non zero real numbers 
approximated by w and d with a relative error hound e < 1 and let q be an approxima- 
tion ofw/d with the same relative error hound. 



\W -qD\ <e 



1-e 



\W\ 



To reduce e, we replace the initial values do and di by do © di and do + di — 
do © di . As a consequence, the worst error on the Theorem 113 is the one on the 
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remainder and it is bounded by construction by 4 ulps. This guarantees with 
the loop condition that any term q^ x dj is smaller than all the terms already 
produced by the partial product generation chain except qi x do. This last value 
is handled separately as a' and b' are replaced by a' + b' — qida. The loop 



Conclusion 

We have presented a new adaptable numeric core inspired both from floating 
point expansions and from on-line arithmetic. Our choice was to present only 
the arithmetic core as we have presented in past publications how a core can be 
used efficiently to compute matrix determinants for example [ p7| ] . Building a 
general purpose runtime environment raises question in areas like parallelism 
(demand driven vs. data flow...) and in compiler techniques (efficiency of ob- 
ject oriented code generated by existing compilers...) among others. 

The numeric core is cut down to four tools. The S3 operator that contains 
many arithmetic operations is proved correct. The proofs have been validated 
by the Coq assistant. Developing the proofs, we have formally proved many 
result long published in the literature and we have extended a few of them. 
This work may let users 

• Develop application specific adaptable libraries based on the toolset. 

• Easily write new formal proofs based on the growing set of sensible vali- 
dated facts. 

The set of the validated facts is now sufficient to shield the user from the 
difficult implementation details of floating point arithmetic. The proof of the 
key Theorem on the E3 operator does not use any low level result. It just uses 
the many lower and upper bounds defined in the Theorems of Section 1 . 
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A A brief overview of Coq 



In this appendix, we give a quick overview of the Coq system. For a more complete 
introduction, we refer the reader to [Bit]. 

Coq is a generic prover based on type-theory In this system, users can define new 
objects and prove properties that derive logically from these definitions. Objects in Coq 
are typed and functions are first-class objects using a Lisp-like notation. The system is 
distributed with standard libraires that define types like nat, Z and R which correspond 
to the natural numbers, the relative numbers and the reals respectively. To give an ex- 
ample, the addition for natural number is represented by the function plus whose type 
is nat -^ nat -^ nat. A function plus3 that does the sum of 3 natural numbers is 
defined by the following command: 

Definition plus3 := [a,b,c:nat] (plus a (plus be)). 

Arguments between square brackets provide the parameters of the function. The type 
of plus 3 is the expected one: nat -^ nat -^ nat -^ nat. 

In our formalization, we first define the type float that represents the record com- 
posed of the mantissa and the exponent by the command: 

Record float: Set := Float {Fnum: Z; Fexp: Z}. 



18 



This command creates a new type float, a constructor Float of type Z — > Z ^ 
float and two destructors Fnum of type float — > Z and Fexp of type float — » 
Z. So for example, we can define the function Fzero that takes an object of type float 
and returns an object of same exponent but with the mantissa set to zero: 

Definition Fzero := [p : float ] (Float (Fexp p) ) . 

In a similar way, we define the notion of bound. It contains two integers: one for the 
mantissa and one for the exponent: 

Record Fbound: Set := Bound {vNum: nat;dExp: nat } . 

In order to prove properties that are independent of a particular bound, we add the 
command: 

Variable biFbound. 

With this arbitrary bound, the notion of being bounded is defined as: 

Definition Fbounded := [p: float] 

((Zle (Zopp (vNum b) ) (Fnum p) ) /\ (Zle (Fnum p) (vNum b) ) ) /\ 
(Zle (Zopp (dExp b) ) (Fexp p) ) . 

A pretty-printed version of this definition could be: 

Definition Fbounded := Ap: float. 
- (vNum b) < (Fnum p) < (vNum b) A -(dExp b) < (Fexp p) . 

Now that we have these definitions we can prove a very simple first theorem: 

Theorem BoundedZero : (p:float) (Fbounded p) -> (Fbounded (Fzero p) ) . 

A pretty-printed version of this theorem could be: 

Theorem BoundedZero: Vp: float. (Fbounded p) => (Fbounded (Fzero p) 

When the previous command is received by the system, it enters the proof mode. The 
statement of the theorem is put om the top of the goal stack. By applying a command 
called tactic we replace the goal at the top of the stack by a (possibly empty) list of sub- 
goals. The proof is finished when the stack is empty. Our initial stack has only one goal. 
Each goal contains a list of hypothesis and a conclusion separated by a bar: 

1 subgoal 



(p: float) (Fbounded p) -> (Fbounded (Fzero p) ) 

The first step is to introduce the hypothesis. For this, we apply the tactic Intros. 

Intros p H. 

We are now reduce to prove (Fbounded (Fzero p) ) in the contect where p is a 
float and H is a proof that p is bounded: 

1 subgoal 

p : float 

H : (Fbounded p) 



(Fbounded (Fzero p) ) 
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Next we expand the definition of Fbounded in the goal. For this, we use the tactic Red 
and get the new goal: 

1 subgoal 

p : float 

H : (Fbounded p) 



( (Zle (Zopp (vNum b) ) (Fnum (Fzero p) ) /\ 
(Zle (Fnum (Fzero p) ) (vNum b) ) ) /\ 
(Zle (Zopp (dExp b) ) (Fexp (Fzero p) ) ) 

This goal can be simplified using the definition of Fzero by the tactic Simpl: 

1 subgoal 

p : float 

H : (Fbounded p) 



((Zle (Zopp (vNum b) ) 0) /\ (Zle (vNum b) ) ) /\ 
(Zle (Zopp (dExp b) ) (Fexp p) ) 

We now need to break the conjunctions into separate subgoals. This is done by the tactic 
Split. As the conjunction is nested, this tactic needs to be applied repeatedly: 

Repeat Split. 

We get the expected three subgoals. 

3 subgoals 

p : float 

H : (Fbounded p) 



(Zle (Zopp (vNum b) ) 0) 

subgoal 2 is: 

(Zle (vNum b) ) 

subgoal 3 is: 

(Zle (Zopp (dExp b) ) (Fexp p) ) 



The first goal is simple enough to be proved automatically by the following tactic Intuition. 
We are left with two subgoals. 

2 subgoals 

p : float 

H : (Fbounded p) 



(Zle (vNum b) ) 

subgoal 2 is: 

(Zle (Zopp (dExp b) ) (Fexp p) ) 



Applying the same tactic to these two subgoals ends the proof. The final proof script 
looks like: 
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Theorem BoundedZero : (p:float) (Fbounded p) -> (Fbounded (Fzero p) ) 

Intros p H. 

Red. 

Simpl . 

Repeat Split. 

Intuition . 

Intuition . 

Intuition . 

Qed. 

Using the composition of tactics ";", it can be shorten to: 

Theorem BoundedZero: (p: float) (Fbounded p) -> (Fbounded (Fzero p) ) 

Intros p H; Red; Simpl; Repeat Split; Intuition. 

Qed. 
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